【Day 24】第一個 Scrapy 爬蟲

第 11 屆 iThome 鐵人賽

DAY 25

AI & Data

爬蟲在手、資料我有 - 30 天 Scrapy 爬蟲實戰系列第 25 篇

11th鐵人賽 python crawler 爬蟲 scrapy

Rex Chien

2019-10-09 10:02:05

6171 瀏覽

分享至

昨天用 scrapy genspider ithome ithome.com 指令建立出來的爬蟲檔案 ithome.py 內容是這樣：

import scrapy

class IthomeSpider(scrapy.Spider):
    name = 'ithome'
    allowed_domains = ['ithome.com']
    start_urls = ['http://ithome.com/']

    def parse(self, response):
        pass

每一支爬蟲都應該繼承 scrapy.Spider 類別，幾個重要的屬性和方法已經由 Scrapy CLI 自動產生了，分別說明如下：

name：每支爬蟲在專案中的「唯一」名稱。
allowed_domains：定義這支爬蟲允許的網域清單，如果清單中不包含目標網址的網域或子網域，此次請求會被略過。
start_urls：爬蟲啟動時爬取的網址清單，會在 scrapy.Spider 類別中的 start_requests() 方法中被使用；也可以不定義這個屬性，改成覆寫 start_requests() 方法。
parse(response)：預設用來處理回應的回呼方法。每支爬蟲都會有一到多個不同的 parse(response) 方法。

start_requests() 和 parse(response) 方法都必須回傳可迭代的（iterable）請求或爬取到的項目實例。

執行爬蟲

start_urls 指定技術文章的網址，並在 parse(response) 處理回應時將收到的 HTML 原始碼存到檔案中，修改後的 ithome.py 程式內容：

import scrapy

class IthomeSpider(scrapy.Spider):
    name = 'ithome'
    allowed_domains = ['ithome.com']
    start_urls = ['https://ithelp.ithome.com.tw/articles?tab=tech']

    def parse(self, response):
        with open('ithome.html', 'wb') as f:
            f.write(response.body)